Soccer: is scoring goals a predictable Poissonian process? 
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The non-scientific event of a soccer match is analysed on a strictly scientific level. The analysis 
is based on the recently introduced concept of a team fitness (Eur. Phys. J. B 67, 445, 2009) and 
requires the use of finite-size scaling. A uniquely defined function is derived which quantitatively 
predicts the expected average outcome of a soccer match in terms of the fitness of both teams. It 
is checked whether temporary fitness fluctuations of a team hamper the predictability of a soccer 
match. To a very good approximation scoring goals during a match can be characterized as in- 
dependent Poissonian processes with pre-determined expectation values. Minor correlations give 
rise to an increase of the number of draws. The non-Poissonian overall goal distribution is just a 
consequence of the fitness distribution among different teams. The limits of predictability of soccer 
matches are quantified. Our model-free classification of the underlying ingredients determining the 
outcome of soccer matches can be generalized to different types of sports events. 
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In recent years different approaches, originating from 
the physics community, have shed new light on sports 
events, e.g. by studying the behavior of spectators by 
elucidating the statistical vs. systematic features behind 
league tables |2|-|4|, by studying the temporal sequence 
of ball movements [H[ or using extreme value statistics 
0,0] known, e.g., from finance analysis For the spe- 
cific case of soccer matches different models have been 
introduced on phenomenological grounds 0-14|. How- 
ever, very basic questions related, e.g., to the relevance 
of systematic vs. statistical contributions or the tempo- 
ral fitness evolution are still open. It is known that the 
distribution of socc er g oals is broader than a Poissonian 
distribution 0, [H, Ea|- This observation has been at- 
tributed to the presence of self-affirmative effects during 
a soccer matchlla, 



16|, 



i.e. an increased probability to 
score a goal depending on the number of goals already 
scored by that team. 

In this work we introduce a general model-free ap- 
proach which allows us to elucidate the outcome of sports 
events. Combining strict mathematical reasoning, appro- 
priate finite-size scaling and comparison with actual data 
all ingredients of this framework can be quantified for the 
specific example of soccer. A unique relation can be de- 
rived to calculate the expected outcome of a soccer match 
and three hierarchical levels of statistical influence can be 
identified. As one application we show that the skewness 
of the distribution of soccer goals 0, EH EH can be fully 
related to fitness variations among different teams and 
does not require the presence of self-affirmative effects. 

As data basis we take all matches in the German 
Bundesliga (www.bundesliga-statistik.de) between sea- 
sons 1987/88 and 2007/08 except for the year 1991/92 
(in that year the league contained 20 teams). Every 
team plays 34 matches per season. Earlier seasons are 
not taken into account because the underlying statistical 



properties (in particular number of goals per match) are 
somewhat different. 

Conceptually, our analysis relies on recent observa- 
tions in describing soccer leagues 17|: (i) The home 



advantage is characterized by a team-independent but 
season-dependent increase of the home team goal differ- 
ence Chome > 0. (ii) An appropriate observable to char- 
acterize the fitness of a team i in a given season is the 
average goal difference (normalized per match) AGi (N) , 
i.e. the difference of the goals scored and conceded during 
N matches. In particular it contains more information 
about the team fitness than, e.g., the number of points. 
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FIG. 1: The correlation function h(t). The average value 
of h(t) is included (excluding the value for t = 17) yielding 
approx. 0.22 [TJ. 

Straightforward information about the team behav- 
ior during a season can be extracted from correlating 
its match results from different match days. Formally, 
this is expressed by the correlation function h(t) — 
(&gij(tQ)£\g ik {t + t)). Here Ag {j := g, ; - g 3 denotes 



2 



the goal difference of a match of team i vs. team j with 
the final result gi : gj . j and k are the opponents of team 
i at match days to an d to + t. The home-away asym- 
metry can be taken into account by the transformation 
Agij — > Agij =F c\- lome where the sign depends on whether 
team i plays at home or away. The resulting function h(t) 
is shown in Fig.l. Apart from the data point for t = 17 
one observes a time-independent positive plateau value. 
The absolute value of this constant corresponds to the 
variance a AG of AG,; and is thus a measure for the fitness 



variation in a league [171 ]. Furthermore, the lack of any 
decay shows that the fitness of a team is constant during 
the whole season. This result is fully consistent with the 
finite-size scaling analysis in Ref.[l7j where additionally 
the fitness change between two seasons was quantified. 
The exception for t — 17 just reflects the fact that team 
i is playing against the same team at days to and to + 17, 
yielding additional correlations between the outcome of 
both matches (see also below). 

As an immediate consequence, the limit of AGi(N) for 
large N, corresponding to the true fitness AGi, is well- 
defined. A consistent estimator for AGi, based on the 
information from a finite number of matches, reads 



AGi = a N AGi(N). 



(1) 



with a N 1/[1 + S/(Na AG )] [17] . For large N the 
factor approaches unity and the estimation becomes 
error- free, i.e. AG 4 (iV) -» AG t . For N = 33 one 
has apf — 0.71 and the variance of the estimation er- 
ror is given by a\ N = (N/3 + l/cr 2 ^)- 1 w 0.06 \rft. 
This statistical framework is known as regression to- 
ward the mean [l8j]. Analogously, introducing SGi(JV) 
as the average sum of goals scored and conceded by 
team i in N matches its long-time limit is estimated via 
SGi — A = bjsr(EGi(N) — A) where A is the average num- 
ber of goals per match in the respective season. Using 
ffw « 0.035 one correspondingly obtains 6^=33 = 0.28 

Our key goal is to find a sound characterization of the 
match result when team i is playing vs. team j, i.e. 
Ag^ or even gi and gj individually. The final outcome 
Ag^ has three conceptually different and uncorrelated 
contributions 



Ag^ = q i3 + f i: 



(2) 



Averaging over all matches one can define the respec- 
tive variances o-q,o-"j and a~. (1) <Zij expresses the aver- 
age outcome which can be expected based on knowledge 
of the team fitness values AGi and AGj, respectively. 
Conceptually this can be determined by averaging over 
all matches when teams with these fitness values play 
against each other. The task is to determine the depen- 
dence of qij = q(AGi,AGj) on AG l and AGj. (2) For 
a specific match, however, the outcome can be systemat- 
ically influenced by different factors beyond the general 



fitness values using the variable fij with a mean of zero: 
(a) External effects such as several players which are in- 
jured or tired, weather conditions (helping one team more 
than the other), or red cards. As a consequence the effec- 
tive fitness of a team relevant for this match may differ 
from the estimation AGi (or AGj). (b) Intra- match ef- 
fects depending on the actual course of a match. One 
example is the suggested presence of self- affirmative ef- 
fects, i.e. an increased probability to score a goal (equiv- 
alently an increased fitness) depending on the number of 
goals already scored by that team 15j, |l6j. Naturally, 
fij is much harder to predict if possible at all. Here we 
restrict ourselves to the estimation of its relevance via 
determination of aj. (3) Finally, one has to understand 
the emergence of the actual goal distribution based on 
expectation values as expressed by the random variable 
Tij with average zero. This problem is similar to the 
physical problem when a decay rate (here corresponding 
to qij + fij ) has to be translated into the actual number 
of decay processes. 

Determination of qij : qij has to fulfill the two basic 
conditions (taking into account the home advantage): 
qij - Chome = -(qji - Chome) (symmetry condition) and 
(lij)j ~ c home = AGi (consistency condition) where the 
average is over all teams j ^ i (in the second condition 
a minor correction due to the finite number of teams in 
a league is neglected). The most general dependence on 
AGij up to third order, which is compatible with both 
conditions, is given by 

qij =c home +(AGi-AGj)-[l-c 3 (a 2 AG +AGiAGj)]. (3) 

Qualitatively, the C3-term takes into account the possi- 
ble effect that in case of very different team strengths 
(e.g. AGi 3> and AG^ <C 0) the expected goal dif- 
ference is even more pronounced (03 > 0: too much 
respect of the weaker team) or reduced (03 < 0: ten- 
dency of presumption of the better team). On a phe- 
nomenological level this effect is already considered in the 
model of, e.g., Ref. 12]. The task is to determine the ad- 
justable parameter C3 from comparison with actual data. 
We first rewrite EqJ3] as q^ — (AGi — AGj) — Chome — 
-c 3 (AG l - AGj)(a AG + AGi AGj). In case that AGij 
is known this would correspond to a straightforward re- 
gression problem of Ag^ — (AGi — AGj) — Chome vs. 
— (AGi - AGj)(a AG + AGiAGj). An optimum estima- 
tion of the fitness values for a specific match via Eq[T] 
is based on AGij(N), calculated from the remaining 
N = 33 matches of both teams in that season . Of course, 
the resulting value of cs(N = 33) is still hampered by 
finite-size effects, in analogy to the regression towards the 
mean. This problem can be solved by estimating C3(N) 
for different values of N and subsequent extrapolation to 
infinite N in an 1/iV-representation. Then our estima- 
tion of C3 is not hampered by the uncertainty in the de- 
termination of AGij. For a fixed N < 30 the regression 
analysis is based on 50 different choices of AGij(N) by 
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choosing different subsets of N matches to improve the 
statistics. The result is shown in Fig. 2. The estimated er- 
ror results from performing this analysis individually for 
each season. Due to the strong correlations for different 
N- values the final error is much larger than suggested by 
the fluctuations among different data points. The data 
are compatible with C3 = 0. Thus, we have shown that 
the simple choice 

qij = AG, - AGj + c home (4) 

is the uniquely defined relation (neglecting irrelevant 
terms of 5th order) to characterize the average outcome 
of a soccer match. In practice the right side can be 
estimated via EqfT] This result implies that hit) — 
((AGj - AGj)(AGi - AG fe )) = a 2 AG + (AGjAG k ), i.e. 
h(t ^ 17) = a AG and h(t = 17) = 2a\ G . This agrees 
very well with the data. Furthermore, the variance of 
the distribution, i.e. a 2 , is by definition given by 
2a| G «0.44. 

0.1, . , ■ , . , . , , 



has a 2 ~ A — 2a\ G . Actually, to improve the statis- 
tics we have additionally used different partitions of the 
match (e.g. first and third quarter vs. second and fourth 
quarter). Numerical evaluation yields a 2 = —0.04 ±0.06 
where the error bar is estimated from individual aver- 
aging over the different seasons. Thus one obtains in 
particular a 2 <C a 2 which renders match-specific fitness 
fluctuations irrelevant. Actually, as shown in |17j . one 
can observe a tendency that teams which have lost 4 
times in a row tend to play worse in the near future than 
expected by their fitness. Strictly speaking these strikes 
indeed reflect minor temporary fitness variations. How- 
ever, the number of strikes is very small (less than 10 per 
season) and, furthermore, mostly of statistical nature. 
The same holds for red cards which naturally influence 
the fitness but fortunately are quite rate. Thus, these 
extreme events are interesting in their own right but are 
not relevant for the overall statistical description. The 
negative value of a 2 points towards anti-correlations be- 
tween both partitions of the match. A possible reason is 
the observed tendency towards a draw, as outlined below. 
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FIG. 2: Determination of C3 by finite-size scaling. 

Determination of a 2 : This above analysis does not 
contain any information about the match-specific fitness 
relative to AGi — AGj. For example > dur- 
ing a specific match implies that team i plays better 
than expected from q^. The conceptual problem is to 
disentangle the possible influence of these fitness fluc- 
tuations from the random aspects of a soccer match. 
The key idea is based on the observation that, e.g., 
for fij > team i will play better than expected in 
both the first and the second half of the match. In 
contrast, the random features of a match do not show 
this correlation. For the identification of a 2 one defines 

Chome))i] where 



{{{Ag^/b x -c home )-{{AgP/b 2 



Ag^'^ is the goal difference in the first and second half 
in the specific match, respectively and 61,2 the fraction 
of goals scored during the first and the second half, re- 
spectively (bi — 0.45; 62 = 0.55). Based on EqfJ] one 
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FIG. 3: (a)Distribution of goals per team and match and 
the Poisson prediction if the different fitness values are taken 
into account (solid line). Furthermore a Poisson estimation 
is included where only the home-away asymmetry is included 
(broken line). The quality of the predicted distribution is 
highlighted in (b) where the ratio of the estimated and the 
actual probability is shown. 

Determination of rij : The actual number of goals gij 
per team and match is shown in Fig. 3. The error bars are 
estimated based on binomial statistics. As discussed be- 
fore the distribution is significantly broader than a Pois- 
son distribution, even if separately taken for the home 
and away goals [3) EH 16 1. Here we show that this distri- 
bution can be generated by assuming that scoring goals 
are independent Poissonian processes. We proceed in two 
steps. First, we use Eq|4]to estimate the average goal dif- 
ference for a specific match with fitness values estimated 
from the remaining 33 matches of each team. Second, we 
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supplement EqQ]by the corresponding estimator for the 
sum of the goals <ft + gj given by EGi + HGj — A. To- 
gether with EqH] this allows us to calculate the expected 
number of goals for both teams individually. Third, we 
generate for both teams a Poissonian distribution based 
on the corresponding expectation values. The resulting 
distribution is also shown in Fig.l and perfectly agrees 
with the actual data up to 8 (!) goals. In contrast, if the 
distribution of fitness values is not taken into account sig- 
nificant deviations are present. Two conclusions can be 
drawn. First, scoring goals is a highly random process. 
Second, the good agreement again reflects the fact that 
<7 2 is small because otherwise an additionally broadening 
of the actual data would be expected. Thus there is no 
indication of a possible influence of self-affirmative effects 
during a soccer match 



16|. Because of the underly- 
ing Poissonian process the value of er 2 is just given by the 
average number of goals per match (pa 3). 
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FIG. 4: (a) The probability distribution of the goal difference 
per match together with its estimation based on independent 
Poisson processes of both teams. In (b) it is shown for dif- 
ferent scores how the ratio of the estimated and the actual 
number of draws differ from unity. 



As already discussed in literature the number of draws 
is somewhat larger than expected on the basis of inde- 
pendent Poisson distributions; see, e.g., Refs. [l(J[ll. 
As an application of the present results we quantify this 
statement. In Fig. 4 we compare the calculated distribu- 



tion of Agij with the actual values. The agreement is 



very good except for Agi 



-1, 0, 1. Thus, the simple 



picture of independent goals of the home and the away 
team is slightly invalidated. The larger number of draws 
is balanced by a reduction of the number of matches with 
exactly one goal difference. More specifically, we have 
calculated the relative increase of draws for the different 
results. The main effect is due to the strong increase of 
more than 20% of the 0:0 draws. Note that the present 
analysis has already taken into account the fitness distri- 
bution for the estimation of this number. Starting from 
3:3 the simple picture of independent home and away 



goals holds again. 

The three major contributions to the final soccer re- 
sult display a clear hierarchy, i.e. <r 2 : er 2 . : er 2 ss 10 2 : 
10 1 : 10°. ex 2 , albeit well defined and quantifiable, can be 
neglected for two reasons. First, it is small as compared 
to the fitness variation among different teams. Second, 
the uncertainty in the prediction of c/ij is, even at the end 
of the season, significantly larger (variance of the uncer- 
tainty: 2 • cr 2 Ar=33 = 0.12, see above). Thus, the limit of 
predictability of a soccer match is, beyond the random 
effects, mainly related to the uncertainty in the fitness de- 
termination rather than to match specific effects. Thus, 
the hypothesis of a strictly constant team fitness during 
a season, even on a single-match level cannot be refuted 
even for a data set comprising more than 20 years. In dis- 
agreement with this observation soccer reports in media 
often stress that a team played particularly good or bad. 
Our results suggest that there exists a strong tendency to 
relate the assessment too much to the final result thereby 
ignoring the large amount of random aspects of a match. 

In summary, apart from the minor correlations with 
respect to the number of draws soccer is a surprisingly 
simple match in statistical terms. Neglecting the minor 
differences between a Poissonian and binomial distribu- 
tion and the slight tendency towards a draw a soccer 
match is equivalent to two teams throwing a dice. The 
number 6 means goal and the number of attempts of both 
teams is fixed already at the beginning of the match, re- 
flecting their respective fitness in that season. 

More generally speaking, our approach may serve as 
a general framework to classify different types of sports 
in a three-dimensional parameter space, expressed by 
cr 2 ,cr 2 ,(T 2 . This set of numbers, e.g., determines the de- 
gree of competitiveness Q . For example for matches be- 
tween just two persons (e.g. tennis) one would expect 
that fitness fluctuations (a 2 ) play a much a bigger role 
and that for sports events with many goals or points (e.g. 
basketball) the random effects (a 2 ) are much less pro- 
nounced, i.e. it is more likely that the stronger team in- 
deed wins. Hopefully, the present work stimulates activ- 
ities to characterize different types of sports along these 
lines. 

We greatly acknowledge helpful discussions with B. 
Strauss, M. Trede, and M. Tolan about this topic. 
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